{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating Metadata\n", "\n", "One of ParquetDB’s strengths is the ability to store and manage **metadata** alongside your dataset. You can attach metadata at:\n", "- **Dataset level** (e.g., `version`, `source`, etc.), which applies to the entire table or dataset.\n", "- **Field/column level** (e.g., `units`, `description`, etc.), which applies to specific columns.\n", "\n", "In this notebook, we’ll walk through:\n", "1. **Updating the Schema** – how to add or change fields in the dataset schema, including updating metadata.\n", "2. **Setting Dataset Metadata** – how to set or update top-level metadata for the entire dataset.\n", "3. **Setting Field Metadata** – how to set or update metadata for individual fields (columns).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The `update_schema` method allows you to modify the structure and metadata of your dataset. You can:\n", "- Change the data type of an existing field.\n", "- Add new fields (if your workflow demands it).\n", "- Update the **top-level** metadata (if `update_metadata=True`).\n", "- Optionally normalize the dataset after making schema changes by providing a `normalize_config`.\n", "\n", "\n", "```python\n", "def update_schema(\n", " self,\n", " field_dict: dict = None,\n", " schema: pa.Schema = None,\n", " update_metadata: bool = True,\n", " normalize_config: NormalizeConfig = NormalizeConfig()\n", "):\n", " ...\n", "```\n", "- `field_dict`: A dictionary of field updates, where keys are field names and values are the new field definitions (e.g., pa.int32(), pa.float64()), or pa.field(\"field_name\", pa.int32()).\n", "- `schema`: A fully defined PyArrow Schema object to replace or merge with the existing one.\n", "- `update_metadata`: If True, merges the new schema’s metadata with existing metadata. If False, replaces the metadata entirely.\n", "- `normalize_config`: A NormalizeConfig object for controlling file distribution after the schema update." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "PARQUETDB SUMMARY\n", "============================================================\n", "Database path: my_dataset\n", "\n", "• Number of columns: 3\n", "• Number of rows: 3\n", "• Number of files: 1\n", "• Number of rows per file: [3]\n", "• Number of row groups per file: [1]\n", "• Serialized metadata size per file: [717] Bytes\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "\n", "############################################################\n", "COLUMN DETAILS\n", "############################################################\n", "• Columns:\n", " - age\n", " - id\n", " - name\n", "\n" ] } ], "source": [ "from parquetdb import ParquetDB\n", "from pathlib import Path\n", "import shutil\n", "import pyarrow as pa\n", "\n", "ROOT_DIR = Path(\".\")\n", "DATA_DIR = ROOT_DIR / \"data\"\n", "\n", "if DATA_DIR.exists():\n", " shutil.rmtree(DATA_DIR)\n", " \n", "db_path = ROOT_DIR / \"ParquetDB\"\n", "\n", "\n", "data = [\n", " {\"name\": \"Alice\", \"age\": 30},\n", " {\"name\": \"Bob\", \"age\": 25},\n", " {\"name\": \"Charlie\", \"age\": 35},\n", "]\n", "\n", "db.create(data)\n", "print(db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Update Schema" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "age: int64\n", "id: int64\n", "name: string\n", "----\n", "age: [[30,25,35]]\n", "id: [[0,1,2]]\n", "name: [[\"Alice\",\"Bob\",\"Charlie\"]]\n", "pyarrow.Table\n", "age: double\n", "id: int64\n", "name: string\n", "----\n", "age: [[30,25,35]]\n", "id: [[0,1,2]]\n", "name: [[\"Alice\",\"Bob\",\"Charlie\"]]\n" ] } ], "source": [ "table = db.read()\n", "print(table)\n", "\n", "# Suppose we want to change the 'age' field to float64\n", "field_updates = {\n", " \"age\": pa.field(\n", " \"age\", pa.float64()\n", " ) # or simply pa.float64() if your internal method accepts that\n", "}\n", "\n", "db.update_schema(field_dict=field_updates, update_metadata=True)\n", "\n", "table = db.read()\n", "print(table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting Dataset Metadata" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "PARQUETDB SUMMARY\n", "============================================================\n", "Database path: my_dataset\n", "\n", "• Number of columns: 3\n", "• Number of rows: 3\n", "• Number of files: 1\n", "• Number of rows per file: [3]\n", "• Number of row groups per file: [1]\n", "• Serialized metadata size per file: [854] Bytes\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• source: API\n", "• version: 1.0\n", "\n", "############################################################\n", "COLUMN DETAILS\n", "############################################################\n", "• Columns:\n", " - age\n", " - id\n", " - name\n", "\n" ] } ], "source": [ "# Set dataset-level metadata, merging with existing entries\n", "db.set_metadata({\"source\": \"API\", \"version\": \"1.0\"})\n", "\n", "print(db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we call `set_metadata` again with additional keys:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'source': 'API', 'version': '1.0', 'author': 'Data Engineer', 'department': 'Analytics'}\n" ] } ], "source": [ "# Add more metadata, merging with the existing ones\n", "db.set_metadata({\"author\": \"Data Engineer\", \"department\": \"Analytics\"})\n", "\n", "print(db.get_metadata())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to replace the existing metadata:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'source': 'API_2', 'version': '2.0'}\n" ] } ], "source": [ "# Replace existing metadata\n", "db.set_metadata({\"source\": \"API_2\", \"version\": \"2.0\"}, update=False)\n", "\n", "print(db.get_metadata())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting Field-Level Metadata\n", "\n", "If you want to attach descriptive information to specific fields (columns), use `set_field_metadata`. This is useful for storing **units of measurement**, **data lineage**, or other column-specific properties." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'age': {'units': 'Years', 'description': 'Age of the person'}, 'id': {}, 'name': {}}\n" ] } ], "source": [ "field_meta = {\"age\": {\"units\": \"Years\", \"description\": \"Age of the person\"}}\n", "\n", "db.set_field_metadata(field_meta)\n", "\n", "print(db.get_field_metadata())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Note**: When physically stored, metadata is typically stored in the **Parquet file footer** and read by PyArrow upon loading. If you rely on certain metadata keys in your analysis, ensure your entire workflow consistently updates and preserves them." ] } ], "metadata": { "kernelspec": { "display_name": "parquetdb_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.20" } }, "nbformat": 4, "nbformat_minor": 2 }